Introduction

This dataset contains loan information for approx. 114,000 loans. This analysis is designed to answer meaningful questions about the loan data in question and perform visualizations and aggregations in order to answer those questions. The analysis is as follow:

Load in required libraries

library(tidyverse)
library(GGally)

Load in Data

loan_df <- read.csv('../data/prosperLoanData.csv')

Display Subset of dataframe

head(loan_df[,2:5], n = 4)
##   ListingNumber           ListingCreationDate CreditGrade Term
## 1        193129 2007-08-26 19:09:29.263000000           C   36
## 2       1209647 2014-02-27 08:28:07.900000000               36
## 3         81716 2007-01-05 15:00:47.090000000          HR   36
## 4        658116 2012-10-22 11:02:35.010000000               36

Select Variables that will be Examined in the Analysis and Subset DataFrame

analysis_vars <- c(
    "ListingKey", "EmploymentStatus", "EstimatedLoss",  
    "BorrowerAPR", "LoanStatus", "MonthlyLoanPayment",
    "LoanOriginalAmount", "LoanCurrentDaysDelinquent", "IncomeRange",
    "DebtToIncomeRatio", "AmountDelinquent", "EstimatedReturn"
)
analysis_df <- select(loan_df, analysis_vars)

Get Summary for All the Variables

summary(analysis_df)
##                    ListingKey          EmploymentStatus EstimatedLoss  
##  17A93590655669644DB4C06:     6   Employed     :67322   Min.   :0.005  
##  349D3587495831350F0F648:     4   Full-time    :26355   1st Qu.:0.042  
##  47C1359638497431975670B:     4   Self-employed: 6134   Median :0.072  
##  8474358854651984137201C:     4   Not available: 5347   Mean   :0.080  
##  DE8535960513435199406CE:     4   Other        : 3806   3rd Qu.:0.112  
##  04C13599434217079754AEE:     3                : 2255   Max.   :0.366  
##  (Other)                :113912   (Other)      : 2718   NA's   :29084  
##   BorrowerAPR                      LoanStatus    MonthlyLoanPayment
##  Min.   :0.00653   Current              :56576   Min.   :   0.0    
##  1st Qu.:0.15629   Completed            :38074   1st Qu.: 131.6    
##  Median :0.20976   Chargedoff           :11992   Median : 217.7    
##  Mean   :0.21883   Defaulted            : 5018   Mean   : 272.5    
##  3rd Qu.:0.28381   Past Due (1-15 days) :  806   3rd Qu.: 371.6    
##  Max.   :0.51229   Past Due (31-60 days):  363   Max.   :2251.5    
##  NA's   :25        (Other)              : 1108                     
##  LoanOriginalAmount LoanCurrentDaysDelinquent         IncomeRange   
##  Min.   : 1000      Min.   :   0.0            $25,000-49,999:32192  
##  1st Qu.: 4000      1st Qu.:   0.0            $50,000-74,999:31050  
##  Median : 6500      Median :   0.0            $100,000+     :17337  
##  Mean   : 8337      Mean   : 152.8            $75,000-99,999:16916  
##  3rd Qu.:12000      3rd Qu.:   0.0            Not displayed : 7741  
##  Max.   :35000      Max.   :2704.0            $1-24,999     : 7274  
##                                               (Other)       : 1427  
##  DebtToIncomeRatio AmountDelinquent   EstimatedReturn 
##  Min.   : 0.000    Min.   :     0.0   Min.   :-0.183  
##  1st Qu.: 0.140    1st Qu.:     0.0   1st Qu.: 0.074  
##  Median : 0.220    Median :     0.0   Median : 0.092  
##  Mean   : 0.276    Mean   :   984.5   Mean   : 0.096  
##  3rd Qu.: 0.320    3rd Qu.:     0.0   3rd Qu.: 0.117  
##  Max.   :10.010    Max.   :463881.0   Max.   : 0.284  
##  NA's   :8554      NA's   :7622       NA's   :29084

Show Histograms of All Continuous Variables to be Analyzed

hist_df <- Filter(is.numeric, analysis_df)

ggplot(gather(hist_df), aes(value)) +
    geom_histogram() +
    facet_wrap(~ key, scales = 'free_x')

Most distributions are right-skewed with the exceptions of Estimated Return and Borrower APR. There are also are large amount of 0 values present in Amount Delinuqent and Loan Current Days Delinquent.

Find which Variables are the Most Correlated by Using a Scatter Matrix

ggpairs(hist_df)

Most variables don’t have a strong positive or negative correlation however Borrower APR & Estimated Loss, and Loan Original Amount and Monthly loan payment have a nearly 1 to 1 correlation while Estimated Return and Borrower APR has a strong positive correlation and Estimated Return Estimated loss has a moderate correlation.

Data Analysis Questions

Which Employment Category Has the Highest Monthly Payment?

ggplot(analysis_df %>% filter(!is.na(MonthlyLoanPayment)), aes(x = EmploymentStatus, y = MonthlyLoanPayment)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(MonthlyLoanPayment, EmploymentStatus) %>%
    group_by(EmploymentStatus) %>%
    summarize(
        max = max(MonthlyLoanPayment), med = median(MonthlyLoanPayment),
        avg = mean(MonthlyLoanPayment)
    )
## # A tibble: 9 x 4
##   EmploymentStatus   max   med   avg
##   <fct>            <dbl> <dbl> <dbl>
## 1 ""               1041.  116.  159.
## 2 Employed         2252.  276.  310.
## 3 Full-time        1563.  163.  217.
## 4 Not available    1048.  126.  191.
## 5 Not employed     1086.  169.  183.
## 6 Other            1035.  173.  232.
## 7 Part-time         984.  106.  141.
## 8 Retired          1046.  131.  168.
## 9 Self-employed    1380.  238.  283.

The employment category that has the highest average and maximum monthly payment is Employed.

Which Income Range has the highest Delinquent Amount?

ggplot(analysis_df %>% filter(!is.na(IncomeRange)) , aes(x = IncomeRange, y = AmountDelinquent)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(IncomeRange, AmountDelinquent) %>%
    filter(!is.na(AmountDelinquent)) %>%
    group_by(IncomeRange) %>%
    summarize(
        max = max(AmountDelinquent), med = median(AmountDelinquent),
        avg = mean(AmountDelinquent)
    )
## # A tibble: 8 x 4
##   IncomeRange       max   med   avg
##   <fct>           <dbl> <dbl> <dbl>
## 1 $0              33134     0  473.
## 2 $1-24,999      160039     0  658.
## 3 $100,000+      463881     0 1242.
## 4 $25,000-49,999 444745     0  897.
## 5 $50,000-74,999 284169     0  909.
## 6 $75,000-99,999 265084     0 1204.
## 7 Not displayed   10574     0  419.
## 8 Not employed    91554     0  657.

The largest delinquent amount in terms of both average and max is 100K+ which is surprising though creating a percentage of delinquency amount using income range may tell a different story.

What loan status has the highest debt to income ratio?

ggplot(analysis_df %>% filter(!is.na(DebtToIncomeRatio)), aes(DebtToIncomeRatio)) +
    geom_histogram() +
    facet_wrap(~ LoanStatus, scales = 'free')

analysis_df %>%
    select(LoanStatus, DebtToIncomeRatio) %>%
    filter(!is.na(DebtToIncomeRatio)) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(DebtToIncomeRatio), med = median(DebtToIncomeRatio),
        avg = mean(DebtToIncomeRatio)
    )
## # A tibble: 12 x 4
##    LoanStatus                max   med   avg
##    <fct>                   <dbl> <dbl> <dbl>
##  1 Cancelled               0.34  0.152 0.165
##  2 Chargedoff             10.0   0.21  0.339
##  3 Completed              10.0   0.19  0.264
##  4 Current                10.0   0.23  0.262
##  5 Defaulted              10.0   0.22  0.369
##  6 FinalPaymentInProgress  0.570 0.2   0.227
##  7 Past Due (>120 days)    0.62  0.225 0.278
##  8 Past Due (1-15 days)    6.49  0.23  0.265
##  9 Past Due (16-30 days)  10.0   0.25  0.339
## 10 Past Due (31-60 days)   1.19  0.22  0.256
## 11 Past Due (61-90 days)  10.0   0.22  0.376
## 12 Past Due (91-120 days)  3.17  0.24  0.292

The loan status with the highest average debt to income ratio is Past Due at 0.376. Several categories are tied with the same max value (10.01) which is consistent with the description of the max capped value for the Debt to Income Ratio variable. These loan status are Charged Off, Completed, Current, Defaulted, and the previously mentioned past due.

Which Employment status has the highest estimated return?

ggplot(analysis_df %>% filter(!is.na(EstimatedReturn)), aes(x = EmploymentStatus, y = EstimatedReturn)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(EmploymentStatus, EstimatedReturn) %>%
    filter(!is.na(EstimatedReturn)) %>%
    group_by(EmploymentStatus) %>%
    summarize(
        max = max(EstimatedReturn), med = median(EstimatedReturn),
        avg = mean(EstimatedReturn)
    )
## # A tibble: 7 x 4
##   EmploymentStatus   max    med    avg
##   <fct>            <dbl>  <dbl>  <dbl>
## 1 Employed         0.224 0.0905 0.0947
## 2 Full-time        0.267 0.102  0.103 
## 3 Not employed     0.226 0.122  0.119 
## 4 Other            0.176 0.099  0.0991
## 5 Part-time        0.184 0.110  0.106 
## 6 Retired          0.203 0.112  0.106 
## 7 Self-employed    0.284 0.0936 0.0973

The employment status with the highest max estimated returned value is Self-Employed with 0.28. However the highest average value is not-employed. This is also true for median. This situation is due to outliers in the self-employed category, which are more extreme on the positive side than not-employed, even though not-employed’s category’s median and average are larger.

Which loan status has the highest Estimated Loss?

ggplot(analysis_df %>% filter(!is.na(EstimatedLoss)), aes(x = LoanStatus, y = EstimatedLoss)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(LoanStatus, EstimatedLoss) %>%
    filter(!is.na(EstimatedLoss)) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(EstimatedLoss), med = median(EstimatedLoss),
        avg = mean(EstimatedLoss)
    )
## # A tibble: 11 x 4
##    LoanStatus               max    med    avg
##    <fct>                  <dbl>  <dbl>  <dbl>
##  1 Chargedoff             0.366 0.112  0.116 
##  2 Completed              0.366 0.085  0.0868
##  3 Current                0.203 0.0649 0.0734
##  4 Defaulted              0.366 0.112  0.112 
##  5 FinalPaymentInProgress 0.183 0.0774 0.0833
##  6 Past Due (>120 days)   0.165 0.106  0.109 
##  7 Past Due (1-15 days)   0.183 0.0975 0.0980
##  8 Past Due (16-30 days)  0.195 0.0975 0.101 
##  9 Past Due (31-60 days)  0.203 0.099  0.0996
## 10 Past Due (61-90 days)  0.183 0.0975 0.103 
## 11 Past Due (91-120 days) 0.168 0.102  0.101

The loan statuses that have the maximum estimated loass are Charged Off and Completed at 0.366. But Charged Off has a higher median and average. This again due to the effect of outliers.

Which Income Range has the highest estimated return

ggplot(analysis_df %>% filter(!is.na(EstimatedReturn)), aes(x = IncomeRange, y = EstimatedReturn)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(IncomeRange, EstimatedReturn) %>%
    filter(!is.na(EstimatedReturn)) %>%
    group_by(IncomeRange) %>%
    summarize(
        max = max(EstimatedReturn), med = median(EstimatedReturn),
        avg = mean(EstimatedReturn)
    )
## # A tibble: 7 x 4
##   IncomeRange      max    med    avg
##   <fct>          <dbl>  <dbl>  <dbl>
## 1 $0             0.170 0.124  0.116 
## 2 $1-24,999      0.226 0.112  0.109 
## 3 $100,000+      0.267 0.0827 0.0879
## 4 $25,000-49,999 0.257 0.101  0.102 
## 5 $50,000-74,999 0.284 0.0903 0.0948
## 6 $75,000-99,999 0.257 0.0872 0.0920
## 7 Not employed   0.226 0.122  0.119

The income range with the maximum estimated return is the 50K to 74.9K range. However the income range with highest median value is 0 dollars and the highest average value is Not employed. This could be due to non-employed people marking their income as 0 dollars, but more information is needed to come to that conclusion.

Are the estimated return and monthly loan payment correlated?

ggplot(analysis_df, aes(x = MonthlyLoanPayment, y = EstimatedReturn)) +
    geom_point()

analysis_df %>%
    select(MonthlyLoanPayment, EstimatedReturn) %>%
    filter(!is.na(MonthlyLoanPayment) & !is.na(EstimatedReturn)) %>%
    summarize(corr_coeff = cor(MonthlyLoanPayment, EstimatedReturn))
##   corr_coeff
## 1 -0.2527313

The monthly loan payment and estimated return are negatively correlated though instead of going to zero the values seem to descend to 0.05, but there are stronger negative and postive values towards the left of the plot.

Are Borrower APR and Amount Delinquent Correlated?

ggplot(analysis_df, aes(x = LoanOriginalAmount, y = EstimatedReturn)) +
    geom_point()

analysis_df %>%
    select(EstimatedReturn, LoanOriginalAmount) %>%
    filter(!is.na(LoanOriginalAmount) & !is.na(EstimatedReturn)) %>%
    summarize(corr_coeff = cor(LoanOriginalAmount, EstimatedReturn))
##   corr_coeff
## 1 -0.2861175

Even though Estimated Return and the loan original amount are negatively correlated, the visual linear pattern is fairly unique. Most estimated return values descend only to just under 0.1, however at the beginning there is some noise for both negative & positive values.

Is the monthly loan payment and Debt to income ratio correlated?

ggplot(analysis_df, aes(x = BorrowerAPR, y = AmountDelinquent)) +
    geom_point()

analysis_df %>%
    select(BorrowerAPR, AmountDelinquent) %>%
    filter(!is.na(BorrowerAPR) & !is.na(AmountDelinquent)) %>%
    summarize(corr_coeff = cor(BorrowerAPR, AmountDelinquent))
##   corr_coeff
## 1 0.06567919

Even though there is a slight positive correlation between Borrower APR and Amount Delinquent, it is nearly 0 and makes sense given that most values are near the 0 line.

Which employment status has the longest time in deliquency?

ggplot(analysis_df %>% filter(!is.na(LoanCurrentDaysDelinquent)), aes(x = EmploymentStatus, y = LoanCurrentDaysDelinquent)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(EmploymentStatus, LoanCurrentDaysDelinquent) %>%
    filter(!is.na(LoanCurrentDaysDelinquent)) %>%
    group_by(EmploymentStatus) %>%
    summarize(
        max = max(LoanCurrentDaysDelinquent), med = median(LoanCurrentDaysDelinquent),
        avg = mean(LoanCurrentDaysDelinquent)
    )
## # A tibble: 9 x 4
##   EmploymentStatus   max   med   avg
##   <fct>            <dbl> <dbl> <dbl>
## 1 ""                2704     0 344. 
## 2 Employed          1222     0  26.9
## 3 Full-time         2497     0 378. 
## 4 Not available     2599     0 497. 
## 5 Not employed      2367     0 189. 
## 6 Other             1052     0  37.8
## 7 Part-time         2389     0 329. 
## 8 Retired           2394     0 382. 
## 9 Self-employed     2421     0 202.

The category with the highest maximum in delinquency is uncategorized and the highest average value is not available. This could be that the uncategorized employment status could be not available but more information is needed to make that conclusion.

Which loan status has the highest monthly payment?

ggplot(analysis_df, aes(x = LoanStatus, y = MonthlyLoanPayment)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(LoanStatus, MonthlyLoanPayment) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(MonthlyLoanPayment), med = median(MonthlyLoanPayment),
        avg = mean(MonthlyLoanPayment)
    )
## # A tibble: 12 x 4
##    LoanStatus               max   med   avg
##    <fct>                  <dbl> <dbl> <dbl>
##  1 Cancelled               111.  39.1  61.5
##  2 Chargedoff             1553. 174.  235. 
##  3 Completed              2252. 171.  219. 
##  4 Current                1781. 290.  320. 
##  5 Defaulted              1103. 168.  233. 
##  6 FinalPaymentInProgress 2164. 218.  298. 
##  7 Past Due (>120 days)    578. 268.  281. 
##  8 Past Due (1-15 days)   1385. 238.  285. 
##  9 Past Due (16-30 days)  1278. 201.  277. 
## 10 Past Due (31-60 days)  1237. 220.  282. 
## 11 Past Due (61-90 days)   904. 201.  258. 
## 12 Past Due (91-120 days)  884. 205.  263.

The loan status with the highest max monthly payment is completed while the category with the highest median and average monthly payment is current. This is due to the extremity of the outliers for the completed category, which could be accounted by people paying off their loan balances.

Which income range has the highest monthly payment?

ggplot(analysis_df, aes(x = IncomeRange, y = MonthlyLoanPayment)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(IncomeRange, MonthlyLoanPayment) %>%
    group_by(IncomeRange) %>%
    summarize(
        max = max(MonthlyLoanPayment), med = median(MonthlyLoanPayment),
        avg = mean(MonthlyLoanPayment)
    )
## # A tibble: 8 x 4
##   IncomeRange      max   med   avg
##   <fct>          <dbl> <dbl> <dbl>
## 1 $0             1131.  170.  267.
## 2 $1-24,999      1048.  134.  155.
## 3 $100,000+      2252.  375.  412.
## 4 $25,000-49,999 1382.  174.  210.
## 5 $50,000-74,999 1778.  253.  280.
## 6 $75,000-99,999 2112.  302.  329.
## 7 Not displayed  1048.  122.  182.
## 8 Not employed   1086.  170.  184.

The income range with the highest monthly payment is the 100K+ range which makes sense given that people with larger incomes can afford larger loans.

Which loan status has the highest rate of return?

ggplot(analysis_df %>% filter(!is.na(EstimatedReturn)), aes(x = LoanStatus, y = EstimatedReturn)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(LoanStatus, EstimatedReturn) %>%
    filter(!is.na(EstimatedReturn)) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(EstimatedReturn), med = median(EstimatedReturn),
        avg = mean(EstimatedReturn)
    )
## # A tibble: 11 x 4
##    LoanStatus               max    med    avg
##    <fct>                  <dbl>  <dbl>  <dbl>
##  1 Chargedoff             0.284 0.125  0.123 
##  2 Completed              0.267 0.107  0.102 
##  3 Current                0.176 0.0863 0.0902
##  4 Defaulted              0.254 0.127  0.123 
##  5 FinalPaymentInProgress 0.170 0.0922 0.0933
##  6 Past Due (>120 days)   0.149 0.120  0.119 
##  7 Past Due (1-15 days)   0.176 0.111  0.111 
##  8 Past Due (16-30 days)  0.159 0.112  0.111 
##  9 Past Due (31-60 days)  0.173 0.112  0.111 
## 10 Past Due (61-90 days)  0.170 0.114  0.114 
## 11 Past Due (91-120 days) 0.176 0.115  0.115

The Loan Status with the maximum rate of return is Charged off, but the status with the highest median and average rate of returns is Defaulted. This is due to large outliers in the charged off category.

Which loan status has the highest APR?

ggplot(analysis_df, aes(x = LoanStatus, y = BorrowerAPR)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(LoanStatus, BorrowerAPR) %>%
    filter(!is.na(BorrowerAPR)) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(BorrowerAPR), med = median(BorrowerAPR),
        avg = mean(BorrowerAPR)
    )
## # A tibble: 12 x 4
##    LoanStatus               max   med   avg
##    <fct>                  <dbl> <dbl> <dbl>
##  1 Cancelled              0.256 0.207 0.206
##  2 Chargedoff             0.462 0.263 0.258
##  3 Completed              0.512 0.195 0.209
##  4 Current                0.358 0.205 0.214
##  5 Defaulted              0.506 0.240 0.239
##  6 FinalPaymentInProgress 0.358 0.224 0.230
##  7 Past Due (>120 days)   0.358 0.287 0.284
##  8 Past Due (1-15 days)   0.381 0.267 0.264
##  9 Past Due (16-30 days)  0.358 0.273 0.268
## 10 Past Due (31-60 days)  0.358 0.273 0.266
## 11 Past Due (61-90 days)  0.358 0.275 0.273
## 12 Past Due (91-120 days) 0.385 0.282 0.271

The loan status with the highest maximum APR is completed, however the category with highest median and average value Past Due (>120 days). This is due to the extremes of the completed category being more that the Past Due (>120 days).

Which income range had the highest debt to income ratio?

ggplot(analysis_df %>% filter(!is.na(DebtToIncomeRatio)), aes(x = IncomeRange, y = DebtToIncomeRatio)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(IncomeRange, DebtToIncomeRatio) %>%
    filter(!is.na(DebtToIncomeRatio)) %>%
    group_by(IncomeRange) %>%
    summarize(
        max = max(DebtToIncomeRatio), med = median(DebtToIncomeRatio),
        avg = mean(DebtToIncomeRatio)
    )
## # A tibble: 7 x 4
##   IncomeRange      max   med   avg
##   <fct>          <dbl> <dbl> <dbl>
## 1 $1-24,999      10.0  0.32  0.737
## 2 $100,000+      10.0  0.17  0.181
## 3 $25,000-49,999  7.9  0.26  0.279
## 4 $50,000-74,999 10.0  0.23  0.246
## 5 $75,000-99,999  2.55 0.2   0.214
## 6 Not displayed  10.0  0.16  0.297
## 7 Not employed   10.0  0.295 3.33

All but two of the income range categories reach the maximum debt to income ratio value, which is due to the ceiling placed on the maximum value. So the median and average values are more meaningful. The category with the highest median values is 1-24,99, and the category with the highest average debt to income ratio is Not employed. Which makes sense given the lower income brackets.

Which employment status had the highest rate of return?

ggplot(analysis_df %>% filter(!is.na(EstimatedReturn)), aes(x = EmploymentStatus, y = EstimatedReturn)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(EmploymentStatus, EstimatedReturn) %>%
    filter(!is.na(EstimatedReturn)) %>%
    group_by(EmploymentStatus) %>%
    summarize(
        max = max(EstimatedReturn), med = median(EstimatedReturn),
        avg = mean(EstimatedReturn)
    )
## # A tibble: 7 x 4
##   EmploymentStatus   max    med    avg
##   <fct>            <dbl>  <dbl>  <dbl>
## 1 Employed         0.224 0.0905 0.0947
## 2 Full-time        0.267 0.102  0.103 
## 3 Not employed     0.226 0.122  0.119 
## 4 Other            0.176 0.099  0.0991
## 5 Part-time        0.184 0.110  0.106 
## 6 Retired          0.203 0.112  0.106 
## 7 Self-employed    0.284 0.0936 0.0973

The employment status with the maximum rate of return is Self-Employed, while the highest median and average values is from not employed. This is due to the extremity of the outliers of self-employed, which also has a large number of outliers as well.

Which employment status had the highest monthly payment?

ggplot(analysis_df, aes(x = EmploymentStatus, y = MonthlyLoanPayment)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(EmploymentStatus, MonthlyLoanPayment) %>%
    group_by(EmploymentStatus) %>%
    summarize(
        max = max(MonthlyLoanPayment), med = median(MonthlyLoanPayment),
        avg = mean(MonthlyLoanPayment)
    )
## # A tibble: 9 x 4
##   EmploymentStatus   max   med   avg
##   <fct>            <dbl> <dbl> <dbl>
## 1 ""               1041.  116.  159.
## 2 Employed         2252.  276.  310.
## 3 Full-time        1563.  163.  217.
## 4 Not available    1048.  126.  191.
## 5 Not employed     1086.  169.  183.
## 6 Other            1035.  173.  232.
## 7 Part-time         984.  106.  141.
## 8 Retired          1046.  131.  168.
## 9 Self-employed    1380.  238.  283.

The employment status with the highest maximum, median, and average monthly payment is Employed. Which makes sense given that category represents clients that is more likely to make regular monthly payments.

Which loan status had the highest loss?

ggplot(analysis_df %>% filter(!is.na(EstimatedLoss)), aes(x = LoanStatus, y = EstimatedLoss)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

analysis_df %>%
    select(LoanStatus, EstimatedLoss) %>%
    filter(!is.na(EstimatedLoss)) %>%
    group_by(LoanStatus) %>%
    summarize(
        max = max(EstimatedLoss), med = median(EstimatedLoss),
        avg = mean(EstimatedLoss)
    )
## # A tibble: 11 x 4
##    LoanStatus               max    med    avg
##    <fct>                  <dbl>  <dbl>  <dbl>
##  1 Chargedoff             0.366 0.112  0.116 
##  2 Completed              0.366 0.085  0.0868
##  3 Current                0.203 0.0649 0.0734
##  4 Defaulted              0.366 0.112  0.112 
##  5 FinalPaymentInProgress 0.183 0.0774 0.0833
##  6 Past Due (>120 days)   0.165 0.106  0.109 
##  7 Past Due (1-15 days)   0.183 0.0975 0.0980
##  8 Past Due (16-30 days)  0.195 0.0975 0.101 
##  9 Past Due (31-60 days)  0.203 0.099  0.0996
## 10 Past Due (61-90 days)  0.183 0.0975 0.103 
## 11 Past Due (91-120 days) 0.168 0.102  0.101

There are three loan statuses with the maximum estimated loss, which are charged off, completed, and defaulted. Two categories are also tied for the highest median value with are charged off and defaulted. Though charged off has the sole highest average value.

Final Plots

The following are highlights from the visualizations used in this assignment and what was learned during their creation.

ggpairs(hist_df)

In this scatter matrix not only are there several variables that are strongly correlated with each other, but that there were also interesting scatterplot patterns that were worth examining further.

ggplot(analysis_df, aes(x = MonthlyLoanPayment, y = EstimatedReturn)) +
    geom_point()

In this scatterplot of Estimated Return vs. Monthly Loan Payment instead of descending to the lowest value, the data points seem to gather around 0.05 horizontally and are more noisy towards 0. This is a unique data point pattern which lead to a negative correlation.

ggplot(analysis_df, aes(x = LoanStatus, y = MonthlyLoanPayment)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the boxplot summaries, it showed that meaningful information can be found by grouping data together and that NA data needed to be removed in order to reduce the amount of errors present in the final visualizations.

Reflection

The Prosper loan dataset was a large rich dataset with decent amount of variables that describe the the loans taken out on their platform. By taking a subset of interesting variables and examining them via data visualizations and summary statistics interesting patterns and observations could be made from the data.